Optimize the use of xml_parent() vs. xml_find_all("parent::*") by MichaelChirico · Pull Request #3011 · r-lib/lintr

MichaelChirico · 2026-01-26T06:31:14Z

Closes #2497. Written with Gemini. The results were pretty strongly against using xml_parent() for "typical" expressions {lintr} will encounter, so I also went ahead and refactored existing usages into XPath equivalents.

I've folded a summary of the dialog and a few full benchmarking scripts in Details below.

Details

Digest of Gemini dialog

  We evaluated replacing pure XPath parent jumps (/parent::*) with the xml2::xml_parent() C-pointer helper across various tree sizes (up to 10,000
  expressions).

  1. The Win: Full-Tree Initial Scans
   * Target: R/source_utils.R (Function call cache initialization).
   * Change: xml_parent(xml_find_all(xml, "//SYMBOL_FUNCTION_CALL")).
   * Finding: ~42% speedup (275ms vs. 159ms).
   * Reasoning: For massive full-tree scans of highly frequent nodes, xml_parent() avoids the overhead of the XPath engine's axis evaluation for
     every single match. This is the only pattern where the optimization was consistently beneficial.

  2. The Regressions: Nodesets and Hybrid Filtering
   * Target: Linters like duplicate_argument_linter and missing_package_linter.
   * Change: Jumping to a parent and then filtering in R (e.g., via vapply) or with a secondary self::expr[...] XPath.
   * Finding: ~87% to 130% slower.
   * Reasoning: We encountered the "R-to-C Wall." Transitioning from R to the C engine multiple times—plus the overhead of R-level loops—is significantly more expensive than a single, complex C-level XPath evaluation. Pure XPath handles parent jumps and predicates in a single pass.

  3. The Regressions: Jumps on Existing Nodesets
   * Target: Linters where we already have a calculated nodeset (e.g., implicit_integer_linter).
   * Change: xml_parent(nodes) vs. xml_find_all(nodes, "parent::*").
   * Finding: ~45% slower.
   * Reasoning: Once a nodeset is already in R memory, the XPath engine is highly optimized for relative axis jumps. The overhead of calling the R helper function xml_parent() actually introduces a regression compared to a relative XPath string.

For R/source_utils.R

library(lintr)
library(xml2)
library(microbenchmark)

# Generate a file with 10,000 function calls to emphasize full-tree scan costs
generate_large_file <- function(n_calls = 10000) {
  lines <- vapply(1:n_calls, function(i) sprintf("f%d(a = %d, b = %d)", i, i, i), character(1L))
  sprintf("{\n%s\n}", paste(lines, collapse = "\n"))
}

text <- generate_large_file(10000)
temp <- tempfile()
writeLines(text, temp)
source_exprs <- get_source_expressions(temp)
xml <- source_exprs$full_xml_parsed_content %||% source_exprs$expressions[[1L]]$xml_parsed_content

bench_orig <- function() {
  res <- xml_find_all(xml, "//SYMBOL_FUNCTION_CALL/parent::expr")
  length(res)
}

bench_optimized <- function() {
  res <- xml_parent(xml_find_all(xml, "//SYMBOL_FUNCTION_CALL"))
  length(res)
}

cat("Benchmark (1) R/source_utils.R (Full-tree Cache Initialization):\n")
print(microbenchmark(orig = bench_orig(), optimized = bench_optimized(), times = 50))
unlink(temp)

For R/duplicate_argument_linter.R

library(lintr)
library(xml2)
library(glue)
library(microbenchmark)

# Setup
except <- c("mutate", "transmute")
ns_calls <- lintr:::xp_text_in_table(except)
xpath_orig <- glue("//EQ_SUB[not(preceding-sibling::expr/SYMBOL_FUNCTION_CALL[{ ns_calls }])]/parent::expr[count(EQ_SUB) > 1]")
xpath_hybrid_base <- glue("//EQ_SUB[not(preceding-sibling::expr/SYMBOL_FUNCTION_CALL[{ ns_calls }])]")

generate_large_file <- function(n_calls = 2000) {
  lines <- vapply(1:n_calls, function(i) {
    if (i %% 10 == 0) sprintf("list(a%d = 1, a%d = 2)", i, i)
    else sprintf("list(a%d = 1, b%d = 2)", i, i)
  }, character(1L))
  sprintf("{\n%s\n}", paste(lines, collapse = "\n"))
}

text <- generate_large_file(2000)
temp <- tempfile()
writeLines(text, temp)
source_exprs <- get_source_expressions(temp)
xml <- source_exprs$full_xml_parsed_content %||% source_exprs$expressions[[1L]]$xml_parsed_content

bench_orig <- function() {
  call_expr <- xml_find_all(xml, xpath_orig)
  length(call_expr)
}

bench_hybrid <- function() {
  # Jump to parents then filter in R
  call_expr <- xml_parent(xml_find_all(xml, xpath_hybrid_base))
  call_expr <- call_expr[vapply(call_expr, function(x) length(xml_find_all(x, "EQ_SUB")) > 1L, logical(1L))]
  length(call_expr)
}

cat("\nBenchmark (2) R/duplicate_argument_linter.R (Hybrid vs. Pure XPath):\n")
print(microbenchmark(orig = bench_orig(), hybrid = bench_hybrid(), times = 50))
unlink(temp)

For R/implicit_integer_linter.R

library(lintr)
library(xml2)
library(microbenchmark)

generate_large_file <- function(n = 5000) {
  lines <- vapply(1:n, function(i) sprintf("x <- -%d", i), character(1L))
  sprintf("{\n%s\n}", paste(lines, collapse = "\n"))
}

text <- generate_large_file(5000)
temp <- tempfile()
writeLines(text, temp)
source_exprs <- get_source_expressions(temp)
xml <- source_exprs$full_xml_parsed_content %||% source_exprs$expressions[[1L]]$xml_parsed_content
nums <- xml_find_all(xml, "//NUM_CONST")

bench_orig <- function() {
  # Pure XPath relative jump
  is_negative <- !is.na(xml_find_first(nums, "parent::expr/preceding-sibling::OP-MINUS"))
}

bench_optimized <- function() {
  # xml_parent helper jump
  is_negative <- !is.na(xml_find_first(xml_parent(nums), "preceding-sibling::OP-MINUS"))
}

cat("\nBenchmark (3) R/implicit_integer_linter.R (Nodeset Jump: helper vs. XPath):\n")
print(microbenchmark(orig = bench_orig(), optimized = bench_optimized(), times = 100))
unlink(temp)

codecov · 2026-01-26T06:34:25Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.23%. Comparing base (e6b2834) to head (d98dc21).

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #3011   +/-   ##
=======================================
  Coverage   99.23%   99.23%           
=======================================
  Files         128      128           
  Lines        7288     7290    +2     
=======================================
+ Hits         7232     7234    +2     
  Misses         56       56

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

MichaelChirico added 3 commits January 26, 2026 06:02

minimal diff

03c824c

back off of xml_parent() in other cases

618bde7

simplify diff

d98dc21

MichaelChirico changed the title ~~Use xml_parent() in build_xml_find_function_calls()~~ Optimize the use of xml_parent() vs. xml_find_all("parent::*") Jan 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize the use of xml_parent() vs. xml_find_all("parent::*")#3011

Optimize the use of xml_parent() vs. xml_find_all("parent::*")#3011
MichaelChirico wants to merge 3 commits intomainfrom
xml-parent

MichaelChirico commented Jan 26, 2026

Uh oh!

codecov bot commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MichaelChirico commented Jan 26, 2026

Digest of Gemini dialog

For R/source_utils.R

For R/duplicate_argument_linter.R

For R/implicit_integer_linter.R

Uh oh!

codecov bot commented Jan 26, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant